14 research outputs found

    MinIE: minimizing facts in open information extraction

    Full text link

    Can we predict new facts with open knowledge graph embeddings? A benchmark for open link prediction

    Full text link
    Open Information Extraction systems extract(“subject text”, “relation text”, “object text”)triples from raw text. Some triples are textualversions of facts, i.e., non-canonicalized men-tions of entities and relations. In this paper, weinvestigate whether it is possible to infernewfacts directly from theopen knowledge graphwithout any canonicalization or any supervi-sion from curated knowledge. For this pur-pose, we propose the open link prediction task,i.e., predicting test facts by completing(“sub-ject text”, “relation text”, ?)questions. Anevaluation in such a setup raises the question ifa correct prediction is actually anewfact thatwas induced by reasoning over the open knowl-edge graph or if it can be trivially explained.For example, facts can appear in different para-phrased textual variants, which can lead to testleakage. To this end, we propose an evaluationprotocol and a methodology for creating theopen link prediction benchmark OLPBENCH.We performed experiments with a prototypicalknowledge graph embedding model for openlink prediction. While the task is very chal-lenging, our results suggests that it is possibleto predict genuinely new facts, which can notbe trivially explained

    MinScIE: Citation-centered open information extraction

    Get PDF
    Acknowledging the importance of citations in scientific literature, in this work we present MinScIE, an Open Information Extraction system which provides structured knowledge enriched with semantic information about citations. By comparing our system to it’s original core, MinIE, we show that our approach improves extraction precision by 3 percentage points

    Robust Text Classification: Analyzing Prototype-Based Networks

    Full text link
    Downstream applications often require text classification models to be accurate, robust, and interpretable. While the accuracy of the stateof-the-art language models approximates human performance, they are not designed to be interpretable and often exhibit a drop in performance on noisy data. The family of PrototypeBased Networks (PBNs) that classify examples based on their similarity to prototypical examples of a class (prototypes) is natively interpretable and shown to be robust to noise, which enabled its wide usage for computer vision tasks. In this paper, we study whether the robustness properties of PBNs transfer to text classification tasks. We design a modular and comprehensive framework for studying PBNs, which includes different backbone architectures, backbone sizes, and objective functions. Our evaluation protocol assesses the robustness of models against character-, word-, and sentence-level perturbations. Our experiments on three benchmarks show that the robustness of PBNs transfers to NLP classification tasks facing realistic perturbations. Moreover, the robustness of PBNs is supported mostly by the objective function that keeps prototypes interpretable, while the robustness superiority of PBNs over vanilla models becomes more salient as datasets get more complex

    Linking Surface Facts to Large-Scale Knowledge Graphs

    Full text link
    Open Information Extraction (OIE) methods extract facts from natural language text in the form of ("subject"; "relation"; "object") triples. These facts are, however, merely surface forms, the ambiguity of which impedes their downstream usage; e.g., the surface phrase "Michael Jordan" may refer to either the former basketball player or the university professor. Knowledge Graphs (KGs), on the other hand, contain facts in a canonical (i.e., unambiguous) form, but their coverage is limited by a static schema (i.e., a fixed set of entities and predicates). To bridge this gap, we need the best of both worlds: (i) high coverage of free-text OIEs, and (ii) semantic precision (i.e., monosemy) of KGs. In order to achieve this goal, we propose a new benchmark with novel evaluation protocols that can, for example, measure fact linking performance on a granular triple slot level, while also measuring if a system has the ability to recognize that a surface form has no match in the existing KG. Our extensive evaluation of several baselines show that detection of out-of-KG entities and predicates is more difficult than accurate linking to existing ones, thus calling for more research efforts on this difficult task. We publicly release all resources (data, benchmark and code) on https://github.com/nec-research/fact-linking

    EAL: A toolkit and dataset for entity-aspect linking

    Full text link
    We present a toolkit and dataset for entity-aspect linking. The tool takes as input a sentence and provides the most relevant aspect for each mentioned entity; it is implemented in Python and available as a script and via an online demo. It is accompanied by the first large dataset of entity-aspects, comprising more than 20,000 entities manually linked to the most relevant aspect, given a sentence as context. Each is expressed in structured manner as Open Information Extraction (OIE) triples (Subject, Relation, Object), having semantic information for polarity, modality, quantity and attributions

    OPIEC: An open information extraction corpus

    Get PDF
    Open information extraction (OIE) systems extract relations and their arguments from natural language text in an unsupervised manner. The resulting extractions are a valuable resource for downstream tasks such as knowledge base construction, open question answering, or event schema induction. In this paper, we release, describe, and analyze an OIE corpus called OPIEC, which was extracted from the text of English Wikipedia. OPIEC complements the available OIE resources: It is the largest OIE corpus publicly available to date (over 340M triples) and contains valuable metadata such as provenance information, confidence scores, linguistic annotations, and semantic annotations including spatial and temporal information. We analyze the OPIEC corpus by comparing its content with knowledge bases such as DBpedia or YAGO, which are also based on Wikipedia. We found that most of the facts between entities present in OPIEC cannot be found in DBpedia and/or YAGO, that OIE facts often differ in the level of specificity compared to knowledge base facts, and that OIE open relations are generally highly polysemous. We believe that the OPIEC corpus is a valuable resource for future research on automated knowledge base construction

    On aligning OpenIE extractions with Knowledge Bases: A case study

    Get PDF
    Open information extraction (OIE) is the task of extracting relations and their corresponding arguments from a natural language text in un- supervised manner. Outputs of such systems are used for downstream tasks such as ques- tion answering and automatic knowledge base (KB) construction. Many of these downstream tasks rely on aligning OIE triples with refer- ence KBs. Such alignments are usually eval- uated w.r.t. a specific downstream task and, to date, no direct manual evaluation of such alignments has been performed. In this paper, we directly evaluate how OIE triples from the OPIEC corpus are related to the DBpedia KB w.r.t. information content. First, we investigate OPIEC triples and DBpedia facts having the same arguments by comparing the information on the OIE surface relation with the KB rela- tion. Second, we evaluate the expressibility of general OPIEC triples in DBpedia. We in- vestigate whether—and, if so, how—a given OIE triple can be mapped to a single KB fact. We found that such mappings are not always possible because the information in the OIE triples tends to be more specific. Our evalua- tion suggests, however, that significant part of OIE triples can be expressed by means of KB formulas instead of individual facts

    Compact open information extraction: methods, corpora, analysis

    Full text link
    Most existing data is stored in unstructured textual formats, which makes their subsequent processing by computers more difficult. The Open Information Extraction (OpenIE) paradigm aims at structuring the knowledge that is contained in text into more machine readable formats. An OpenIE system (usually) extracts triples—(“subject”; “relation”; “object”)— from natural language text in an unsupervised manner, without having predefined relations. OpenIE extractions are used for improving deeper language-understanding tasks, including KB population, link prediction and text comprehension. A common problem for such systems is that they often extract triples which contain unnecessarily detailed constituents. For instance, the phrases “the great Richard Feynman” and “Richard Feynman” have the same meaning, but the first phrase contains redundant words—“the” and “great”—that do not alter the meaning of the head phrase “Richard Feynman”. Such redundant words pose difficulties for using OpenIE in downstream tasks, such as linking entities for KB population. In this thesis, we propose MinIE, an OpenIE system which aims to remove words from the triples that are considered to be overly-specific without damaging the triple’s semantics. The methods proposed in MinIE are domain independent and could in principle be integrated into any other OpenIE system. OpenIE extractions are most useful when they are available in large quantities. Our second contribution, therefore, is OPIEC, which is the largest publicly available OpenIE corpus to date (containing 341M triples). OPIEC was constructed from the entire English Wikipedia and it contains the links found in theWikipedia articles, thus reducing ambiguity in certain cases. Such OpenIE triples with unambiguous arguments are useful for bootstrapping OpenIE extractors as well as for downstream tasks such as KB population. Our final contribution is an analysis of OPIEC. Such analysis is difficult to perform due to the openness and ambiguity of OpenIE extractions. Therefore, we compared the content of OPIEC with reference KBs (DBpedia and YAGO), which are not ambiguous and are also constructed from Wikipedia. Our analysis is (mostly) manual and reveals findings about semantic relatedness between OpenIE corpora and KBs, which are important for downstream tasks such as KB population (e.g., the study suggests that most knowledge found in OpenIE triples is relevant for the current KBs and it is not present in the KBs)
    corecore